1. Introduction

This project aims to detect unusual network activities in a networking system. The original dataset comes from the XYZ Bank’s historical log files. Detailed data description can be found below. To distinguish intrusions from benign sessions, we applied four classification methods on our training dataset, which includes Naive Bayes, Random Forest, Boosting, and K-Nearest Neighbours. Then we performed cross validation to evaluate the predicting power of each method. To identify different types of intrusions, we conducted K-Means clustering and grouped the intrusions into 3 types based on various combinations of attributes.

In sum, our system provides a holistic approach to detecting intrusions and identifying the types. It is helpful in protecting cyber security and users’ privacy.

2. Data Wrangling

load the network traffic data.

## 'data.frame':    2999 obs. of  23 variables:
##  $ duration          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ protocol_type     : Factor w/ 3 levels "icmp","tcp","udp": 2 2 2 2 2 2 2 2 2 2 ...
##  $ service           : Factor w/ 16 levels "auth","domain_u",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ flag              : Factor w/ 6 levels "REJ","RSTO","RSTR",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ src_bytes         : int  302 339 260 213 308 230 221 329 271 326 ...
##  $ dst_bytes         : int  896 1588 7334 8679 1658 505 445 2431 688 566 ...
##  $ land              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ wrong_fragment    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ urgent            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hot               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ num_failed_logins : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ logged_in         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ num_compromised   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ root_shell        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ su_attempted      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ num_root          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ num_file_creations: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ num_shells        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ num_access_files  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ num_outbound_cmds : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_host_login     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_guest_login    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_intrusion      : int  0 0 0 0 0 0 0 0 0 0 ...

In network dataset, there’re 22 predictors. They can be divided into discrete variables and continuous variables. Among discrete variables, some variables (protocol_type, service, flag) are multi-categorical variables with more than 2 levels, others (land, logged_in, root_shell, su_attempted, is_host_login, is_guest_login) are binary variables, in which 0 means No and 1 means Yes. The dependent variable is is_intrusion, value of 0 refers to not an intrusion and 1 refers to is an intrusion.

2.1. Exploratory Data Analysis

This section is about explanatory data analysis with summaries and plots.

## 
##    0    1 
## 2699  300

There are 300 intrusion cases in this dataset.

There are three protocol types. Intrusion cases are distributed only in tcp and udp protocol types. No icmp protocol is intrusion.

flag total intrusion intrusion.rate
RSTR 65 65 1.000
S0 33 33 1.000
S3 2 2 1.000
SF 2744 200 0.073
REJ 154 0 0.000
RSTO 1 0 0.000

There are six flag types. The bar plot and the above table show that ALL RSTR, S0, and S3 flags turn out to be intrusion cases, while NO REJ or RSTO flag case is intrusion.

In this plot, we plotted protocol_type versus flag. Red dots represent intrusion cases, while green dots represent normal cases. The size of the dots represent the number of corresponding cases.

The intrusion distribution conforms to our previous discoveries. In addition, we find that all flag types except SF belong to tcp type of protocol.

service count count.intrusion intrusion.rate
ftp 45 33 0.733
private 247 100 0.405
ftp_data 169 67 0.396
http 1911 100 0.052
auth 8 0 0.000
domain_u 185 0 0.000
eco_i 17 0 0.000
ecr_i 10 0 0.000
finger 14 0 0.000
ntp_u 21 0 0.000
other 58 0 0.000
pop_3 6 0 0.000
smtp 290 0 0.000
telnet 3 0 0.000
time 2 0 0.000
urp_i 13 0 0.000

Only ftp, private, ftp_data, and http types of services have intrusion cases.

The size of the circles represent the number of corresponding cases. We find that intrusion cases fail to log in more often.

The scatter plot shows that the duration of the connection is shorter on average when it is an intrusion case.

Intrusion cases tend to have fewer hot indicators.

From this 2-dimensional scatter plot, we find that intrusion cases have more data bytes from source to destination than from destination to source.

Looking through the whole dataset, We also notice that there are several columns contain only the 0 value. The following chunk extracts all such columns.

## [1] "land"              "wrong_fragment"    "urgent"           
## [4] "num_failed_logins" "num_outbound_cmds" "is_host_login"

The output shows that the following six columns containing only the 0 value: land, wrong_fragment, urgent, num_failed_logins, is_host_login, num_outbound_cmds.

2.2. Data Pre-processing

We have two major objectives for this project: classify whether it is an intrusion activity or not, cluster different types of intrusion.

We convert categorical variables to factor and continuous variables to numeric. For KNN and K-Means, we create dummy variables for categorical variables.

3. Feature Selection

In our report, we don’t use Linear Discriminant Analysis (LDA) or Principal Component Analysis (PCA) for dimension reduction. Because our task is to find the abnormal intrusions among all the sessions, which are rare cases. If we use dimension reduction, indicative features are very likely to be lost. So we work on all the original features.

4. Methodology

We adapt Naive Bayes, Random Forest, Boosting, KNN model to classify intrusion group, and use cross-validation for model selection.

To identify different types of intrusion, we use K-Means method.

4.1. Classification - Detect Intrusions

A major task of our project is to predict whether a network session is an intrusion given the predictors such as its duration and protocol type. Since we have the output, y, in our dataset, we can use supervised learning methods to implement the classification problem. We will first build Naive Bayes, Random Forest, Boosting and K-Nearest Neighbors models and then use cross validation and ROC Curve to evaluate the performance of different models.

4.1.1. Naive Bayes

Confusion Matrix and Error Rate for Naive Bayes

##                 
## network.nb.y.hat   0   1
##                0 549  48
##                1   3   0
## [1] 0.085

In our confusion matrix, predictions are in rows and actual values are in columns. As we can see from the matrix, the Naive Bayes model correctly classifies 549 benign sessions and 0 intrusion. However, it misclassifies 48 intrusions as benign sessions. So even with a relatively low error rate, 8.5%, Naive Bayes works poorly in classifying intrusions.

4.1.2. Random Forest

## 
## Call:
##  randomForest(formula = is_intrusion ~ ., data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 4.96%
## Confusion matrix:
##      0   1 class.error
## 0 2147   0   0.0000000
## 1  119 133   0.4722222

From the figure above, we can see the error rates of OOB and class 0 (a benign session) are small and quite stable when the number of trees is greater than 30. The error rate of class 1 (an intrusion) is relatively large and becomes stable when the number of trees is greater than about 100.

Variable Importance Plot

0 1 MeanDecreaseAccuracy MeanDecreaseGini
duration 18 22.443 21.283 48.536
protocol_type 8 4.618 8.768 7.372
service 11 16.496 16.419 56.011
flag 20 23.023 25.011 66.586
src_bytes 12 19.033 17.959 55.014
dst_bytes 9 11.368 13.828 26.149
land 0 0.000 0.000 0.000
wrong_fragment 0 0.000 0.000 0.000
urgent 0 0.000 0.000 0.000
hot 8 10.475 10.975 7.595
num_failed_logins 0 0.000 0.000 0.000
logged_in 5 12.830 7.829 12.450
num_compromised 2 1.001 2.008 0.083
root_shell 0 0.000 0.000 0.000
su_attempted 0 0.000 0.000 0.000
num_root -3 0.471 -1.549 0.298
num_file_creations 0 1.001 1.001 0.022
num_shells 0 0.000 0.000 0.024
num_access_files 0 0.000 0.000 0.005
num_outbound_cmds 0 0.000 0.000 0.000
is_host_login 0 0.000 0.000 0.000
is_guest_login 8 11.802 11.954 7.371

From the Variable Importance Plot above, we can see flag, service, src_bytes, duration and dst_bytes are variables with top 5 high Mean Decrease Gini. They are the important variables to predict an intrusion based on the Random Forest Model.

Confusion Matrix and Error Rate for Random Forest

##                 
## network.rf.y.hat   0   1
##                0 552  18
##                1   0  30
## [1] 0.03

The predictability of Random Forest Model on intrusions is slightly higher than the Naive Bayes Model. It can correctly recognize almost all benign sessions (552 of 552) and most intrusions (30 of 48). The overall error rate is 3%.

4.1.3. Boosting

## gbm(formula = is_intrusion ~ ., distribution = "multinomial", 
##     data = train, n.trees = 5000, interaction.depth = 4)
## A gradient boosted model with multinomial loss function.
## 5000 iterations were performed.
## There were 22 predictors of which 10 had non-zero influence.

##                                   var      rel.inf
## dst_bytes                   dst_bytes 5.229369e+01
## service                       service 3.411658e+01
## flag                             flag 4.905660e+00
## duration                     duration 3.135220e+00
## protocol_type           protocol_type 3.035872e+00
## src_bytes                   src_bytes 2.456172e+00
## logged_in                   logged_in 5.673620e-02
## is_guest_login         is_guest_login 6.127751e-05
## num_root                     num_root 3.313331e-07
## hot                               hot 8.432320e-10
## land                             land 0.000000e+00
## wrong_fragment         wrong_fragment 0.000000e+00
## urgent                         urgent 0.000000e+00
## num_failed_logins   num_failed_logins 0.000000e+00
## num_compromised       num_compromised 0.000000e+00
## root_shell                 root_shell 0.000000e+00
## su_attempted             su_attempted 0.000000e+00
## num_file_creations num_file_creations 0.000000e+00
## num_shells                 num_shells 0.000000e+00
## num_access_files     num_access_files 0.000000e+00
## num_outbound_cmds   num_outbound_cmds 0.000000e+00
## is_host_login           is_host_login 0.000000e+00

The important variables of Boosting Model are very similar to the Random Forest Model. But their relative influence is in a different order. dst_bytes, service, flag and duration appear to be included in the 5 most important variables in both models.

Confusion Matrix and Error Rate for Boosting

##                  
## network.bst.y.hat   0   1
##                 0 532   1
##                 1  20  47
## [1] 0.035

The Boosting Model can identify almost all intrusions correctly (47 of 48) but its predictability of benign sessions is lower than Naive Bayes and Random Forest(532 of 552). Its overall error rate, 3.5%, is very close to that of Random Forest.

4.1.4. K-Nearest Neighbors

In performing KNN, we use network.dummy as dataset, which converts categorical variables (more than 2 levels) into dummy variables.

We can find that with K=1, KNN will achieve the highest accuracy. So next we use K=1 to develop our model.

Confusion Matrix and Error Rate for KNN

##                  
## network.knn.y.hat   0   1
##                 0 532   2
##                 1  20  46
## [1] 0.03666667

The predictability of KNN Model is very close to Boosting. It correctly classifies 532 of 552 benign sessions and 46 of 48 intrusions. Its overall error rate on the test data is 3.67%.

Important Predictors

Next we’ll explore which attributes yield impacts on classification. Based on test.result, we found several attributes have all inputs as 0, and we also found several representative variables for later plotting.

From those plots, we can find some variables play important role in classfication. For example, higher src_bytes leads to intrusion and higher dst_bytes leads to no intrusion; flag of S0 and RSTR tend to lead to intrusion and REJ tend to lead to no intrusion.

4.1.5. Model Selection & Evaluation

For this section, we use cross-validation and ROC curve to compare different models and conduct the selection.

Cross Validation

Apply cross validation for 4 different models, and select the one with the largest correct rate:

Cross validation results are as follows:

Method Result
Naive Bayes 0.89704
Random Forest 0.95040
Boosting 0.97622
KNN 0.97623

ROC Curve

# Calculate predictor evaluations for Naive Bayes
network.nb.score <- predict(network.nb, network[test,])$posterior[,2]
nb.pred <- prediction(network.nb.score, network[test,]$is_intrusion)
nb.perf <- performance(nb.pred, "tpr", "fpr")
nb.auc <- performance(nb.pred, "auc")@y.values[[1]]

# Calculate predictor evaluations for Random Forest
network.rf.score <- predict(network.rf, network[test,], type = "prob")[,2]
rf.pred <- prediction(network.rf.score, network[test,]$is_intrusion)
rf.perf <- performance(rf.pred, "tpr", "fpr")
rf.auc <- performance(rf.pred, "auc")@y.values[[1]]

# Calculate predictor evaluations for Boosting
network.bst.score <- predict(network.boosting, network[test,], 
                             n.trees = 5000, type = "response")[,2,]
bst.pred <- prediction(network.bst.score, network[test,]$is_intrusion)
bst.perf <- performance(bst.pred, "tpr", "fpr")
bst.auc <- performance(bst.pred, "auc")@y.values[[1]]

# Calculate predictor evaluations for KNN
network.knn.score <- predict(network.knn, network.dummy[test,], type = "prob")[,2]
knn.pred <- prediction(network.knn.score, network.dummy[test,]$is_intrusion)
knn.perf <- performance(knn.pred, "tpr", "fpr")
knn.auc <- performance(knn.pred, "auc")@y.values[[1]]

# Create legend of ROC Curve
lgd <- paste(c("Naive Bayes", "Random Forest", "Boosting", "KNN"), "  (AUC:",
             c(round(nb.auc, 6),
               round(rf.auc, 6),
               round(bst.auc, 6),
               round(knn.auc, 6)),
             ")", sep = "")

# Plot ROC Curve
plot(nb.perf, col = 1, lty = 1, lwd=1.5, main = "ROC")
plot(rf.perf, col = 2, lty = 2, lwd=1.5, add = TRUE)
plot(bst.perf, col = 3, lty = 3, lwd=1.5, add = TRUE)
plot(knn.perf, col = 4, lty = 4, lwd=1.5, add = TRUE)
legend("bottomright", legend = lgd,
       col = c(1,2,3,4), lty = c(1,2,3,4), lwd = 1.5)

The ROC Curves of all the 4 models seem to be close to the optimal curve through the point(0,1). Obviously, Random Forest, Boosting and KNN outperform Naive Bayes. Random Forest has the highest AUC among the four.

4.2. Identify Intrusion Types (Unsupervised Learning)

4.2.1. K-Means Clustering

Now we’re performing unsupervised learning via K-menas clustering to identify possible types of network intrusion. The data we used is 300 intrusion observations with converting muiti-categorial variables to multiple dummy variables.

From the pairs plot, we can observe that the data is grouped into 3 clusters approximately.

## # A tibble: 3 x 2
##   `km.out$cluster` count
##              <int> <int>
## 1                1   210
## 2                2    60
## 3                3    30

Here we display the pair plot with color of different 3 clusters we learned. We can find that 3 clusters are reasonable, and there exist several important variables in differentiating them. Next with more detailed plots we can get more insights.

In summary, src_bytes is important to classify 3 different intrusion patterns. Intrusion records with protocol_type of udp tend to be grouped in one cluster, as well as guest login intrusion records, and records with higher number of hot indicators. Intrusion records with service of ftp and private are group to one cluster. Moreover, flag of S0 and RSTR tend to clearly group intrusions into 2 different clusters.

4.2.2 Explore KNN for Classifying Intrusion Patterns

Previously, we used K-means to detect 3 clusters with label of 1, 2, 3. We’ll then label not intrusion as cluster 0, in together get data of all records with 4 clusters.

## # A tibble: 4 x 2
##   cluster count
##   <chr>   <int>
## 1 0        2699
## 2 1         210
## 3 2          60
## 4 3          30

After labeling all data records, apply KNN classification with predicted cluster results.

##                          
## network.knn.cluster.y.hat   0   1   2   3
##                         0 531   0   0   0
##                         1   2  44   0   0
##                         2   1   0  19   0
##                         3   0   0   0   3
## [1] 0.005

The patterns prediction appears to be good with low error rate.

5. Key Findings

5.1. Findings of Part 2 EDA

In Part 2, we did exploratory data analysis on the original dataset. We detected the following interactions between the responsive variable is_intrusion and independent variables:

  • There are three protocol types. Intrusion cases are distributed only in tcp and udp protocol types. No icmp protocol is intrusion.

  • There are six flag types. ALL RSTR, S0, and S3 flags turn out to be intrusion cases, while NO REJ or RSTO flag case is intrusion.

  • Intrusion cases encounter failure in logging in more often.

  • On average, the duration of the connection is shorter when it is an intrusion case.

  • Intrusion cases tend to have fewer hot indicators.

  • By plotting dst_bytes against src_bytes, we find that intrusion cases have more data bytes from source to destination than from destination to source.

Note: We didn’t control for other variables when doing EDA, thus the findings above only help readers gain a clearer picture on the dataset while no solid conclusions can be drawn.

To clarify, in our report, we don’t use LDA/PCA for dimension reduction. Because our task is to find the abnormal intrusions among all the sessions, which are rare cases. If we use dimension reduction, indicative features are very likely to be lost. So we work on all the original features.

5.2. Findings of Part 4 Methodology

In Part 3, we used various data mining methods to address the task problems:

1. Determine if it is possible to differentiate between the labeled intrusions and benign sessions.

  • Yes. This is a typical classification problem with intrusions labled as is_intrusion = 1 and benign sessions labeled as is_intrusion = 0.

  • In Part 3.1 we applied four models to distinguish intrusions from benign sessions: Naive Bayes, Random Forest, Boosting, K-Nearest Neighbours.

  • After model selection(see more in Q4), we adapted Boosting and K-Nearest Neighbours in our system.

2. Is it possible to identify different types of intrusions? If so, which values of which attributes in data correlate with the specific types of intrusions?

  • Yes, though the group division of intrusions is unknown, we can identify different types of intrusions via unsupervised learning.

  • In Part 3.2 we first applied K-Means clustering to identify different types of intrusions. Intrusions can be classified into 3 types based on various combinations of attributes.At this point, all data can be categorized into 4 groups: benign session and 3 types of intrusions. Then we used KNN to train the data and see how it worked when predicting the four groups. It turned out KNN had a very low misclassification rate of 0.5%, indicating our K-Means method makes sense.

  • After trying out different combinations of attributes, we find that src_bytes is important to classify 3 different intrusion patterns (Figure 1). Intrusion records with protocol_type of udp tend to be grouped in one cluster (Figure 2,3,4), as well as guest login intrusion records (Figure 5), and records with higher number of hot indicators (Figure 2). Intrusion records with service of ftp and private are group to one cluster (Figure 4). Moreover, flag of S0 and RSTR tend to clearly group intrusions into 2 different clusters (Figure 6).

3. Develop and implement a systematic approach to detect instances of intrusions in log files. Your system will need to be able to take a new network_traffic log file and determine the existence of known patterns of intrusions as well as anomalies which may be indicative of new and unknown intrusion patterns.

We apply a three-step systematic approach to deal with new-coming records:

  • Step 1: detect whether a new activity is intrusion.

    We applied four classification methods initially, and after model selection, we decide to adapt the two top performing methods: Boosting and KNN (see more on Q4).

    In this case, the error of missing an intrusion is more costly than mistakening a benign session as intrusion. In other words, we tend to label a suspicious case as intrusion than letting it go.

    Therefore, an activity will be identified as intrusion as long as one of the two models yields the prediction of intrution.

  • Step 2: determine the patterns of intrusions
    After identifying all intrusion records, plug them into the K-Means model and they will be grouped into 3 types.

  • Step 3: examine whether there are anomalies indicating new patterns
    In Part 2 EDA we found that there are 6 attributes that contain only values of 0. If there are new records whose 6 attributes contain 1 or other values, they might indicate the emergence of new types of intrusion patterns.

4. Evaluate detection power of your system.

We used misclassification rate, sensitivity, cross-validation accuracy and ROC to examine the detction power of our classification methods.

  • Figure 7: Random Forest, Boosting and KNN outperforms Naive Bayes in terms of misclassification rate. The misclassification rate of Random Forest, Boosting and KNN is slightly above 3%, while that of Naive Bayes is above 8.5%.

  • Figure 8: As mentioned before, we’d rather mistaken benign sessions for intrutions than letting go true intrusions, thus sensitivity is important. Boosting and KNN have the two highest sensitivity of over 95%, Random forest has good sensitivity of 62.5%, while Naive Bayes has very poor sensitivity of 0. This indicates that Naive Bayes should be excluded from our system.

  • Figure 9: After running a 10-fold cross validation, we can see that Boosting and KNN has the highest accuracy of 97.6%, RF also has a very good accuracy of 95%, and Naive Bayes has an accuracy of 89.7% which is obviously lower than the other three.

  • Figure 10: In terms of ROC curve, the four models are close to each other in performance, with Naive Bayes slighly inferior to the other three.

In conclusion, in terms of misclassification rate, sensitivity, cross-validation accuracy and ROC, the Boosting and KNN methods we adapted in our system have good detection power.

5. Can your intrusion detector be used in real-time? It would need to be able to receive data about a current session, and in seconds determine if it is likely to be and intrusion of previously seen type or an anomaly potentially signifying an unseen yet intrusion mode. What information should be exchanged via the user interface of such system?

Yes, it can detect and classify intrusions in real time. The work flow of our system is as follows:

  • Once a new record arrives, it will be plugged into the Boosting and KNN models, and if one or more models give the prediction of is_intrusion = True, it would be labeled as intrusion.

  • If a record is labeled as intrusion, it will be plugged into K-Means model to determine its intrusion type.

  • Once every 1000 new records flow in, the system will update itself by splitting training and test data again and generate new models.

Reference

[1] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, pp. 3-7). New York: springer.

[2] Suthaharan, S. (2014). Big data classification: Problems and challenges in network intrusion prediction with machine learning. ACM SIGMETRICS Performance Evaluation Review, 41(4), 70-73.

[3] Muda, Z., Yassin, W., Sulaiman, M. N., & Udzir, N. I. (2011, July). Intrusion detection based on K-Means clustering and Naïve Bayes classification. In 2011 7th International Conference on Information Technology in Asia (pp. 1-6). IEEE.

[4] Dewa, Z., & Maglaras, L. A. (2016). Data mining and intrusion detection systems. International Journal of Advanced Computer Science and Applications, 7(1), 62-71.

[5] Dokas, P., Ertoz, L., Kumar, V., Lazarevic, A., Srivastava, J., & Tan, P. N. (2002, November). Data mining for network intrusion detection. In Proc. NSF Workshop on Next Generation Data Mining (pp. 21-30)

[6] Ernst, Felix G.M. “ROCR v1.0-11.” ROCR Package | R Documentation, 2020, www.rdocumentation.org/packages/ROCR/versions/1.0-11.

[7] Ding, C., & He, X. (2004, March). K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization. In Proceedings of the 2004 ACM symposium on Applied computing (pp. 584-589)

[8] Data Mining and Intrusion Detection System, 2016. https://thesai.org/Downloads/Volume7No1/Paper_9-Data_Mining_and_Intrusion_Detection_Systems.pdf

[9] Data Mining for Network Intrusion Detection, 2002. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.6701&rep=rep1&type=pdf